Goto

Collaborating Authors

 k-nearest neighbor


Distributionally robust weighted k-nearest neighbors

Neural Information Processing Systems

Learning a robust classifier from a few samples remains a key challenge in machine learning. A major thrust of research has been focused on developing k-nearest neighbor (k-NN) based algorithms combined with metric learning that captures similarities between samples. When the samples are limited, robustness is especially crucial to ensure the generalization capability of the classifier. In this paper, we study a minimax distributionally robust formulation of weighted k-nearest neighbors, which aims to find the optimal weighted k-NN classifiers that hedge against feature uncertainties. We develop an algorithm, Dr.k-NN, that efficiently solves this functional optimization problem and features in assigning minimax optimal weights to training samples when performing classification. These weights are class-dependent, and are determined by the similarities of sample features under the least favorable scenarios. When the size of the uncertainty set is properly tuned, the robust classifier has a smaller Lipschitz norm than the vanilla k-NN, and thus improves the generalization capability.


Learning to Clean: Reinforcement Learning for Noisy Label Correction

Heidari, Marzi, Zhang, Hanping, Guo, Yuhong

arXiv.org Artificial Intelligence

The challenge of learning with noisy labels is significant in machine learning, as it can severely degrade the performance of prediction models if not addressed properly. This paper introduces a novel framework that conceptualizes noisy label correction as a reinforcement learning (RL) problem. The proposed approach, Reinforcement Learning for Noisy Label Correction (RLNLC), defines a comprehensive state space representing data and their associated labels, an action space that indicates possible label corrections, and a reward mechanism that evaluates the efficacy of label corrections. RLNLC learns a deep feature representation based policy network to perform label correction through reinforcement learning, utilizing an actor-critic method. The learned policy is subsequently deployed to iteratively correct noisy training labels and facilitate the training of the prediction model. The effectiveness of RLNLC is demonstrated through extensive experiments on multiple benchmark datasets, where it consistently outperforms existing state-of-the-art techniques for learning with noisy labels.


Multi-View Semi-Supervised Label Distribution Learning with Local Structure Complementarity

Xiao, Yanshan, Wu, Kaihong, Liu, Bo

arXiv.org Artificial Intelligence

Label distribution learning (LDL) is a paradigm that each sample is associated with a label distribution. At present, the existing approaches are proposed for the single-view LDL problem with labeled data, while the multi-view LDL problem with labeled and unlabeled data has not been considered. In this paper, we put forward the multi-view semi-supervised label distribution learning with local structure complementarity (MVSS-LDL) approach, which exploits the local nearest neighbor structure of each view and emphasizes the complementarity of local nearest neighbor structures in multiple views. Specifically speaking, we first explore the local structure of view $v$ by computing the $k$-nearest neighbors. As a result, the $k$-nearest neighbor set of each sample $\boldsymbol{x}_i$ in view $v$ is attained. Nevertheless, this $k$-nearest neighbor set describes only a part of the nearest neighbor information of sample $\boldsymbol{x}_i$. In order to obtain a more comprehensive description of sample $\boldsymbol{x}_i$'s nearest neighbors, we complement the nearest neighbor set in view $v$ by incorporating sample $\boldsymbol{x}_i$'s nearest neighbors in other views. Lastly, based on the complemented nearest neighbor set in each view, a graph learning-based multi-view semi-supervised LDL model is constructed. By considering the complementarity of local nearest neighbor structures, different views can mutually provide the local structural information to complement each other. To the best of our knowledge, this is the first attempt at multi-view LDL. Numerical studies have demonstrated that MVSS-LDL attains explicitly better classification performance than the existing single-view LDL methods.



Prediction of Coffee Ratings Based On Influential Attributes Using SelectKBest and Optimal Hyperparameters

Agyemang, Edmund, Agbota, Lawrence, Agbenyeavu, Vincent, Akabuah, Peggy, Bimpong, Bismark, Attafuah, Christopher

arXiv.org Artificial Intelligence

This study explores the application of supervised machine learning algorithms to predict coffee ratings based on a combination of influential textual and numerical attributes extracted from user reviews. Through careful data preprocessing including text cleaning, feature extraction using TF-IDF, and selection with SelectKBest, the study identifies key factors contributing to coffee quality assessments. Six models (Decision Tree, KNearest Neighbors, Multi-layer Perceptron, Random Forest, Extra Trees, and XGBoost) were trained and evaluated using optimized hyperparameters. Model performance was assessed primarily using F1-score, Gmean, and AUC metrics. Results demonstrate that ensemble methods (Extra Trees, Random Forest, and XGBoost), as well as Multi-layer Perceptron, consistently outperform simpler classifiers (Decision Trees and K-Nearest Neighbors) in terms of evaluation metrics such as F1 scores, G-mean and AUC. The findings highlight the essence of rigorous feature selection and hyperparameter tuning in building robust predictive systems for sensory product evaluation, offering a data driven approach to complement traditional coffee cupping by expertise of trained professionals.


ReTrack: Data Unlearning in Diffusion Models through Redirecting the Denoising Trajectory

Shi, Qitan, Jin, Cheng, Zhang, Jiawei, Gu, Yuantao

arXiv.org Artificial Intelligence

Diffusion models excel at generating high-quality, diverse images but suffer from training data memorization, raising critical privacy and safety concerns. Data unlearning has emerged to mitigate this issue by removing the influence of specific data without retraining from scratch. We propose ReTrack, a fast and effective data unlearning method for diffusion models. ReTrack employs importance sampling to construct a more efficient fine-tuning loss, which we approximate by retaining only dominant terms. This yields an interpretable objective that redirects denoising trajectories toward the $k$-nearest neighbors, enabling efficient unlearning while preserving generative quality. Experiments on MNIST T-Shirt, CelebA-HQ, CIFAR-10, and Stable Diffusion show that ReTrack achieves state-of-the-art performance, striking the best trade-off between unlearning strength and generation quality preservation.


Proof of AutoML: SDN based Secure Energy Trading with Blockchain in Disaster Case

Toprak, Salih, Erel-Ozcevik, Muge

arXiv.org Artificial Intelligence

In disaster scenarios where conventional energy infrastructure is compromised, secure and traceable energy trading between solar-powered households and mobile charging units becomes a necessity. To ensure the integrity of such transactions over a blockchain network, robust and unpredictable nonce generation is vital. This study proposes an SDN-enabled architecture where machine learning regressors are leveraged not for their accuracy, but for their potential to generate randomized values suitable as nonce candidates. Therefore, it is newly called Proof of AutoML. Here, SDN allows flexible control over data flows and energy routing policies even in fragmented or degraded networks, ensuring adaptive response during emergencies. Using a 9000-sample dataset, we evaluate five AutoML-selected regression models - Gradient Boosting, LightGBM, Random Forest, Extra Trees, and K-Nearest Neighbors - not by their prediction accuracy, but by their ability to produce diverse and non-deterministic outputs across shuffled data inputs. Randomness analysis reveals that Random Forest and Extra Trees regressors exhibit complete dependency on randomness, whereas Gradient Boosting, K-Nearest Neighbors and LightGBM show strong but slightly lower randomness scores (97.6%, 98.8% and 99.9%, respectively). These findings highlight that certain machine learning models, particularly tree-based ensembles, may serve as effective and lightweight nonce generators within blockchain-secured, SDN-based energy trading infrastructures resilient to disaster conditions.


A Model-Mediated Stacked Ensemble Approach for Depression Prediction Among Professionals

Ahmmed, Md. Mortuza, Noman, Abdullah Al, Afif, Mahin Montasir, Kabir, K. M. Tahsin, Rahman, Md. Mostafizur, Mahmud, Mufti

arXiv.org Artificial Intelligence

Depression is a significant mental health concern, particularly in professional environments where work-related stress, financial pressure, and lifestyle imbalances contribute to deteriorating well-being. Despite increasing awareness, researchers and practitioners face critical challenges in developing accurate and generalizable predictive models for mental health disorders. Traditional classification approaches often struggle with the complexity of depression, as it is influenced by multifaceted, interdependent factors, including occupational stress, sleep patterns, and job satisfaction. This study addresses these challenges by proposing a stacking-based ensemble learning approach to improve the predictive accuracy of depression classification among professionals. The Depression Professional Dataset has been collected from Kaggle. The dataset comprises demographic, occupational, and lifestyle attributes that influence mental well-being. Our stacking model integrates multiple base learners with a logistic regression-mediated model, effectively capturing diverse learning patterns. The experimental results demonstrate that the proposed model achieves high predictive performance, with an accuracy of 99.64% on training data and 98.75% on testing data, with precision, recall, and F1-score all exceeding 98%. These findings highlight the effectiveness of ensemble learning in mental health analytics and underscore its potential for early detection and intervention strategies.


A Comparative Study of Diabetes Prediction Based on Lifestyle Factors Using Machine Learning

Nguyen, Bruce, Zhang, Yan

arXiv.org Artificial Intelligence

Diabetes is a prevalent chronic disease with significant health and economic burdens worldwide. Early prediction and diagnosis can aid in effective management and prevention of complications. This study explores the use of machine learning models to predict diabetes based on lifestyle factors using data from the Behavioral Risk Factor Surveillance System (BRFSS) 2015 survey. The dataset consists of 21 lifestyle and health-related features, capturing aspects such as physical activity, diet, mental health, and socioeconomic status. Three classification models, Decision Tree, K-Nearest Neighbors (KNN), and Logistic Regression, are implemented and evaluated to determine their predictive performance. The models are trained and tested using a balanced dataset, and their performances are assessed based on accuracy, precision, recall, and F1-score. The results indicate that the Decision Tree, KNN, and Logistic Regression achieve an accuracy of 0.74, 0.72, and 0.75, respectively, with varying strengths in precision and recall. The findings highlight the potential of machine learning in diabetes prediction and suggest future improvements through feature selection and ensemble learning techniques.


A Novel Zero-Touch, Zero-Trust, AI/ML Enablement Framework for IoT Network Security

Shakya, Sushil, Abbas, Robert, Maric, Sasa

arXiv.org Artificial Intelligence

The IoT facilitates a connected, intelligent, and sustainable society; therefore, it is imperative to protect the IoT ecosystem. The IoT-based 5G and 6G will leverage the use of machine learning and artificial intelligence (ML/AI) more to pave the way for autonomous and collaborative secure IoT networks. Zero-touch, zero-trust IoT security with AI and machine learning (ML) enablement frameworks offers a powerful approach to securing the expanding landscape of Internet of Things (IoT) devices. This paper presents a novel framework based on the integration of Zero Trust, Zero Touch, and AI/ML powered for the detection, mitigation, and prevention of DDoS attacks in modern IoT ecosystems. The focus will be on the new integrated framework by establishing zero trust for all IoT traffic, fixed and mobile 5G/6G IoT network traffic, and data security (quarantine-zero touch and dynamic policy enforcement). We perform a comparative analysis of five machine learning models, namely, XGBoost, Random Forest, K-Nearest Neighbors, Stochastic Gradient Descent, and Native Bayes, by comparing these models based on accuracy, precision, recall, F1-score, and ROC-AUC. Results show that the best performance in detecting and mitigating different DDoS vectors comes from the ensemble-based approaches.